Method and Apparatus for Monitoring a Status of Nodes of a Communication Network

ABSTRACT

The present invention relates to a method and apparatus for monitoring a status of nodes of a communication network. The method determines first node status data at a first node by diagnosing the own status of the first node and the status of at least one second node, sends the first node status data to at least one second node, receives second node status data from at least one second node, and determines node status evaluation data at the first node based on the determined first node status data and the received second node status data. Furthermore, improvements are proposed for the sake of efficiency and/or robustness of the method.

This application is a continuation of U.S. application Ser. No.12/466,088, filed May 14, 2009, and claims priority from EP patentapplication no. 08009078.0, filed May 16, 2008, the entire disclosuresof which are hereby incorporated by reference herein.

The present invention relates to a method and apparatus for monitoring astatus of nodes of a communication network.

Such a method and apparatus are known, for example, from EP 1 769 993A2, which is herewith incorporated by reference. The present inventionis based on this prior art document and provides improvements.

EP 1 769 993 A2 describes a highly dependable communication network usedby a vehicle control system, for which an example is given in FIG. 1hereof. The vehicle control system shown in FIG. 1 is equipped with anintegrated vehicle motion control ECU 110 which centrally controls thevehicle motion based on signals from a steering angle sensor Sen 1 whichmeasures a rotation angle of a steering wheel 141, brake pedal positionsensor Sen2 which measures a depression of a brake pedal 152,accelerator pedal position sensor Sen3 which measures a depression of anaccelerator pedal, and sensors (e.g., an acceleration sensor, yaw ratesensor, and wheel speed sensor: not shown) which detect vehicleconditions by interpreting an intention of the driver using signals fromthe sensors which detect driver's requests. These components constitutenodes, being connected to the main network NET100A. In the sense of thepresent patent application, node means (but is not limited to) anyentity that is connected to a network and is able to send and/or receivedata.

An SBW (Steer-by-Wire) VGR (Variable Gear Ratio) driver ECU 111 whichcontrols a steering motor M1 and a motor M5, an SBW driver ECU 112 whichcontrol a steering motor M2, BBW (Brake-by-Wire) driver ECUs 113A to113D which control brake motors M3A to M3D, an integrated DBW(Driver-by-Wire) control ECU 120 which centrally controls a drive systemof the vehicle, EAS (Electronic Active Suspension) Driver ECUs 114A to114D which control suspension motors M4A to M4D are connected asactuator driving nodes to the main network Net100A, where the steeringmotor M1 generates a front-wheel steering force, the motor M5 acts on avariable gear ratio (VGR) mechanism mounted on a steering column, thesteering motor M2 generates a rear-wheel steering force, the brakemotors M3A to M3D generate braking forces for the four wheels, and thesuspension motors M4A to M4D adjust damping forces.

Furthermore, the main network Net100A is connected with a millimeterwave radar/camera Sen4 which detects conditions outside the vehicle andan airbag ECU 115 which controls airbag deployment. The backup networkNet100B is connected with minimum nodes required for safe running of thevehicle: namely, the steering angle sensor Sent, brake pedal positionsensor Sen2, SBW VGR driver ECU 111, and BBW driver ECUs 113A to 113D.

The integrated DBW control ECU 120 is connected with an engine controlECU 121, transmission control ECU 122, motor control ECU 123, andbattery control ECU 124 via a network Net101. The integrated vehiclemotion control ECU 110 is connected with an information gateway 130 andbody system gateway 140 via a network Net102 and exchanges data withthese devices, where the information gateway 130 provides a gateway intoa network which controls car navigation and other information devicesand the body system gateway 140 provides a gateway into a network whichcontrols body-related devices such as door locks, side mirrors, andvarious. meters. Although not shown, the airbag ECU 115 is connected atanother end with a safety-related network which integrates varioussensors and actuators needed for airbag deployment control.

According to this example, the integrated vehicle motion control ECU 110calculates the steering angle, braking forces, driving force, and thelike for vehicle travel control based on signals from a steering anglesensor Sen 1, brake pedal position sensor Sen2, accelerator pedalposition sensor Sen3, and sensors (e.g., an acceleration sensor, yawrate sensor, and wheel speed sensor: not shown) which detect vehicleconditions. Then, it gives steering angle commands to a front-wheel SBWVGR driver ECU 111 and rear-wheel SBW driver ECU 112, braking forcecommands to the BBW driver ECUs 113A to 113D of the four wheels, and adriving force command to the integrated DBW control ECU 120. Uponreceiving the driving force command, the integrated DBW control ECU 120calculates driving forces which power sources such as an engine andmotors should generate, taking energy efficiency into consideration andtransmits resulting driving force commands to the engine control ECU121, motor control ECU 123 and the like via a network. By using not onlythe information from the sensors which detect driver's requests, butalso information from the radar/camera Sen4 which detects the conditionsoutside the vehicle, the integrated vehicle motion control ECU 110 canperform control such as trail driving, lane-keeping driving, andrisk-averse driving.

In such a safety critical vehicle control system, in case a certain nodeof the communication network fails, the remaining nodes have to executea backup control using the information which node has failed. Therefore,it is essential to identify failed nodes accurately and to ensureconsistency of information as to which node has failed among allremaining nodes by communicating through the network. As a consequence,a node status monitoring functionality is needed.

In the following, the approach taken by EP 1 769 993 A2 will bedescribed with reference to FIGS. 2 and 3. The vehicle control systemshown in FIG. 2 consists of multiple nodes—namely, node 1 (Ni), node 2(N2), . . . , node n (Nn)—which are connected via a network Net100. Thenodes are processing units connected to a network and capable ofcommunicating information via the network. Specifically, they includevarious electronic control units, actuator drivers, and sensors mountedon a vehicle. The network Net100 is a communication network capable ofmultiplex communication as well as broadcasting which involvestransmitting the same content simultaneously from a node to all theother nodes connected to the network.

Each node, N1, N2, . . . , or Nn (hereinafter referred to as Nx) has anode status determination section x1 (11, 21, . . . , or n1), statusevaluation result transmitting/receiving section x2 (12, 22, . . . , orn2), and failed-node identification section x3 (13, 23, . . . , or n3),where character x is a node number (1, 2, . . . , n) and the sameapplies hereinafter.

The node status determination section x1 (11, 21, . . . , n1) has anown-node status evaluation section x102 (1102, 2102, . . . , n102) whichdetermines the status of the given node itself and other-node statusevaluation section x101 (1101, 2101, . . . , n101) which determines thestatus of the other nodes in the same network. The “own-node status” isa self-diagnostic result of the own node while the “other-node status”is status regarding whether or not data sent from each of the othernodes is correct as viewed by the judging node. For example, in orderfor node 1 to determine that the “other-node status” of node 2 isnormal, hardware of node 2 must operate normally, arithmetic processingin node 2 must be performed normally, and communication from node 2 tonode 1 must be conducted without error. It is conceivable, for example,to use a configuration in which all nodes broadcast serial number datawhich is incremented in each communication cycle. In that case, if theserial number data received from a node is not incremented, it can bedetermined that the “other-node status” of the node is abnormal.

The status evaluation result transmitting/receiving section x2 (12, 22,. . . , n2) has a data transmission section x203 (1203, 2203, . . . ,n203) which transmits the node status (node status determined by the ownnode) determined by the node status evaluation section x1 (11, 21, . . ., n1) to the other nodes, reception processor x202 (1202, 2202, . . . ,n202) which receives node status as determined by the other nodes, andstatus evaluation result storage section x201 (1201, 2201, . . . , n201)which stores the node evaluation result made by the own node and nodeevaluation result made by the other nodes.

The failed-node identification section x3 (13, 23, . . . , or n3)identifies a failed node based on the node evaluation result made by theown node as well as on the node evaluation result made by the othernodes and received by the status evaluation resulttransmitting/receiving section x2.

As shown in FIG. 3, every node connected to the network Net100 has astatus evaluation result storage x201 (1201, 2201, . . . , n201), whichcontains both evaluation result data of the own node and of the othernodes. For example, node 1's result of the status evaluation withrespect to the other nodes, which is stored in a buffer 141 in thestatus evaluation result storage section 1201 of node 1, is transmittedto all the other nodes via the network Net100 and is stored in thebuffers x41 (241, 341, . . . , n41) in the status evaluation resultstorage sections x201. The same applies to the transmission andreception of the status evaluation results conducted by nodes 2 to n.

Each node finally determines the status of other nodes by voting on theevaluation result data. For example, even if node 1 determines that node2 is faulty, if the other nodes (nodes 2 to n) determine that node 2 isnormal, each node can correctly determine that node 1 rather than node 2is faulty.

One problem of this approach is that the execution time of the votingprocess increases in proportion to the square of the number of nodesbecause the status evaluation result storage section x201 contains n×nbits for an n-node system in case each evaluation result consists ofn-bits as shown in FIG. 4. Moreover, the required communicationbandwidth for every evaluation result is proportional to the number ofnodes.

When the nodes of the network detect that a certain node is faulty, thequestion arises when an application using the network should be notifiedof the failure. In the case that the notification is executed quickly,it may happen that the application is informed about transient failuresthat vanish after a very short period of time. However, if the failureis communicated to the application too late, dangerous situations mayoccur, especially when safety critical components like the brake arefaulty. Therefore, it is necessary to find a balance between a quicknotification of the application and the requirement that only relevantfailures are reported to the application.

Therefore, EP 1 769 993 A2 uses a failure counter, which is incrementedin each communication cycle in which a node is faulty, and a failurecounter threshold. When the failure counter threshold is exceeded by thefailure counter, the application is notified of the failure.

Furthermore, as every node performs the evaluation whether a node isfaulty or not on its own, it may happen that the nodes notify theapplication at different points in time, which is clearly undesirable.Therefore, a synchronisation between the nodes is needed.

FIG. 5 shows an example to illustrate how EP 1 769 993 A2 handles thisproblem. In an early communication cycle it has been determined thatnode 3 is faulty. In the communication cycle i the failure counters ofnode 1 and 4 are incremented to 8. Due to an error, node 2 is behind andhas a failure counter with the value 5. In each communication cycle nodestatus data and failure notification synchronisation flags are sentamong the nodes (transmit frame). In the communication cycle i+1 thefailure counters are incremented again. In the communication cycle i+2the failure counters of node 1 and 4 reach the failure counter thresholdof 10. In this situation, in the frames transmitted from node 1 and node4 to node 2 the failure notification synchronisation flag for node 3 isset. When node 2 receives the frame, it adjusts its own failure counterto 10 and also notifies the application of the failure of node 3 likenode 1 and 4.

This approach may have the problem that it takes a long time to reachthe threshold value, when the threshold value is large. Depending on thefailure rate of the overall system, the failure counter value of themajority of nodes might be corrupted before the threshold value isreached. In this case an agreement on the failure notification timingcannot be achieved.

A further prior art document is the article “A Tunable Add-On DiagnosticProtocol for Time-Triggered Systems” by Marco Serafini et al. publishedin the proceedings of the IEEE International conference on dependablesystems and networks (DSN), 2007. The article proposes a method thataccumulates the information on detected faults using a penalty/rewardalgorithm to handle transient faults.

Based on the prior art, it is an object of the present invention toprovide an efficient method and apparatus for monitoring a status ofnodes of a communication network.

It is a further object to provide a method and apparatus for monitoringa status of nodes of a communication network with improved robustness.

Furthermore, it is an object of the present invention to provide amethod and apparatus for monitoring a status of nodes of a communicationnetwork with short notification times having an acceptable behaviourwith regard to transient faults.

At least one object is accomplished by the independent claims. Preferredembodiments are specified in the dependent claims.

The invention comprises a method for monitoring a status of nodes of acommunication network comprising the steps of

-   -   dividing the communication network into clusters of nodes,    -   determining first node status data at each node of the        communication network by diagnosing the own status of the        determining node and the status of the other nodes of the        communication network,    -   sending first node status data relating to the nodes of the        cluster of the determining node from the determining node to the        other nodes of the communication network,    -   receiving second node status data relating to the nodes of the        cluster of the sending node from the other nodes of the        communication network, and    -   determining node status evaluation data for the nodes of the        communication network based on the determining first node status        data and the second node status data received from the other        nodes of the communication network.

Since only the first node status data relating to the nodes of thecluster of the determining node is sent and only the second node statusdata relating to the nodes of the cluster of the sending node isreceived, in comparison to the state of the art, less communicationbandwidth is needed. Furthermore, the time needed to carry out themethod is reduced.

In some embodiments, the method may comprise the steps of

-   -   receiving at a receiving node second node status data from a        node of a cluster to which the receiving node does not belong,    -   determining at the receiving node first node status data by        diagnosing the status of at least one node of said cluster,    -   determining whether or not the first node status data is        consistent with the second node status data, and    -   diagnosing the status of the receiving node as faulty, if it is        determined that the first node status data is inconsistent with        the second node status data.

Due to the clustering approach, the view that the nodes of a certaincluster have may deviate from the view that a node outside of thecluster has. In a situation in which all nodes of a cluster determinethat a certain node of the cluster is working correctly, while the nodeoutside of the cluster determines that the certain node is faulty, theinvention assumes that the node outside of the cluster is faulty. Thisis a simple, yet efficient approach for resolving inconsistencies.

Furthermore, the invention comprises a method for monitoring a status ofnodes of a communication network comprising the steps of

-   -   determining first node status data at a first node by diagnosing        the own status of the first node and the status of at least one        second node,    -   sending the first node status data to at least one second node,    -   receiving second node status data from at least one second node,        and    -   determining node status evaluation data at the first node based        on the determined first node status data and the received second        node status data,    -   wherein the sending and receiving, is periodically performed in        communication rounds.

In the methods according to the invention, a node may be a first nodeand a second node at the same time depending on the role it adopts. Thefirst node status data of the first node is received on the second nodeas second node status data.

The method may further comprise the step of defining a group of nodes towhich the first node belongs. The group may comprise all the nodes ofthe network, the nodes of a cluster, the nodes of a plurality ofclusters, a subset of the nodes of a cluster or any other number ofnodes.

Based on the determined node status evaluation data, it may bedetermined that a certain node is faulty. In this case, a first failurecounter is initialized for the certain node, the first failure counteris incremented in each communication round in which the node statusevaluation data indicate that the certain node is faulty, the firstfailure counter is sent to the other nodes of the group, and secondfailure counters for the certain node from the other nodes of the groupare received.

According to one aspect of the invention, furthermore the steps ofdetermining a failure counter value that most of the failure counters ofthe nodes of the group have, and adjusting the first failure counter, ifthe determined failure counter value is different from the value of thefirst failure counter, are carried out.

Since it is determined which value most of the failure counters of thenodes of the group have, it becomes possible to adjust the first failurecounter, if it is likely that an error occurred on the node having thefirst failure counter. In this way, the participating nodes of the groupestablish a common opinion of the currently correct failure countervalue. As a consequence, a corruption of the failure counters of thenodes of the group is prevented.

In some embodiments, the method further comprises the step of notifyingan application that uses the communication network of a fault of thecertain node, if the first failure counter reaches a predeterminedthreshold value.

This allows notifying an application of a fault in the case that it issufficiently likely that a relevant fault is present in the network.

Furthermore, the invention comprises a method for monitoring a status ofnodes of a communication network comprising the steps of

-   -   determining first node status data at a first node by diagnosing        the own status of the first node and the status of at least one        second node,    -   sending the first node status data to at least one second node,    -   receiving second node status data from at least one second node,        and    -   determining node status evaluation data at the first node based        on the determined first node status data and the received second        node status data,        wherein the sending and receiving is periodically performed in        communication rounds. Since multiple nodes are determining node        status data, a node A may be a second node for a node B, while        node B is a second node for node A.

Based on the node status evaluation data, it may be determined that acertain node is faulty. According to one aspect of the invention, whenthe certain node is detected as faulty after a predetermined number ofcommunication rounds in which the certain node was working correctly, anoutage counter is initialized, which is incremented in eachcommunication round after the initialization of the outage counter.

The method may further comprise the step of notifying an applicationthat uses said communication network of a fault of the certain node, ifthe outage counter reaches a predetermined threshold value. Through theuse of the outage counter, a quick notification of the application maybe achieved.

In a preferred embodiment, the method may further comprise the step ofresetting the outage counter, when it is detected that the certain nodewas working correctly in a predetermined number of previouscommunication rounds.

In this way, the outage counter is reset in the case that it issufficiently likely that the outage counter was initialized due to atransient fault. As a consequence, it is achieved that the applicationis not notified of transient faults with a high likelihood.

In the above described methods according to the invention, the nodes maybe in-vehicle devices of a vehicle control system. In some embodimentsof the methods according to the invention, the status of a node iseither correct or faulty.

The invention furthermore comprises an apparatus for monitoring a statusof nodes of a communication network comprising

-   -   means for dividing the communication network into clusters of        nodes,    -   means for determining first node status data at each node of the        communication network by diagnosing the own status of the        determining node and the status of the other nodes of the        communication network,    -   means for sending first node status data relating to the nodes        of the cluster of the determining node from the determining node        to the other nodes of the communication network,    -   means for receiving second node status data relating to the        nodes of the cluster of the sending node from the other nodes of        the communication network, and    -   means for determining node status evaluation data for the nodes        of the communication network based on the determined first node        status data and the second node status data received from the        other nodes of the communication network.

This apparatus may have the same advantages as the corresponding methodaccording to the invention.

In some embodiments, the apparatus further comprises

-   -   means for receiving at a receiving node second node status data        from a node of a cluster to which the receiving node does not        belong,    -   means for determining at the receiving node first node status        data by diagnosing the status of at least one node of said        cluster,    -   means for determining whether or not the first node status data        is consistent with the second node status data, and    -   means for diagnosing the status of the receiving node as faulty,        if it is determined that the first node status data is        inconsistent with the second node status data.

With these means, inconsistencies can be easily resolved.

Moreover, the invention comprises an apparatus for monitoring a statusof nodes of a communication network comprising

-   -   means for determining first node status data at a first node by        diagnosing the own status of the first node and the status of at        least one second node,    -   means for sending the first node status data to at least one        second node,    -   means for receiving second node status data from at least one        second node, and    -   means for determining node status evaluation data at the first        node based on the determined first node status data and the        received second node status data,        wherein the sending and receiving is periodically performed in        communication rounds.

The apparatus may furthermore comprise means for defining a group ofnodes to which the first node belongs.

In addition, the apparatus may comprise

-   -   means for determining that a certain node is faulty based on the        determined node status evaluation data,    -   means for initializing a first failure counter for the certain        node,    -   means for incrementing the first failure counter in each        communication round in which the node status evaluation data        indicate that the certain node is faulty,    -   means for sending the first failure counter to the other nodes        of the group, and    -   means for receiving second failure counters for the certain node        from the other nodes of the group.

According to one aspect of the invention, the apparatus furthermorecomprises means for determining a failure counter value that most of thefailure counters of the nodes of the group have, and means for adjustingthe first failure counter, if the determined failure counter value isdifferent from the value of the first failure counter.

With this apparatus, it becomes possible to prevent the failure countersof the nodes from becoming corrupt.

The apparatus may furthermore comprise means for notifying anapplication that uses said communication network of a fault of thecertain node, if the first failure counter reaches a predeterminedthreshold value.

In addition, the invention comprises an apparatus for monitoring astatus of nodes of a communication network comprising

-   -   means for determining first node status data at a first node by        diagnosing the own status of the first node and the status of at        least one second node,    -   means for sending the first node status data to at least one        second node,    -   means for receiving second node status data from at least one        second node, and    -   means for determining node status evaluation data at the first        node based on the determined first node status data and the        received second node status data,        wherein the sending and receiving is periodically performed in        communication rounds.

Furthermore, the apparatus may comprise

-   -   means for determining that a certain node is faulty based on the        node status evaluation data,    -   means for initializing an outage counter when the certain node        is detected as faulty after a predetermined number of        communication rounds in which the certain node was working        correctly, and    -   means for incrementing the outage counter in each communication        round after the initialization of the outage counter.

In addition, the apparatus may comprise means for notifying anapplication that uses said communication network of a fault of thecertain node, if the outage counter reaches a predetermined thresholdvalue.

In a preferred embodiment, the apparatus may comprise means forresetting the outage counter, when it is detected that the certain nodewas working correctly in a predetermined number of previouscommunication rounds.

Based on the outage counter, a quick notification of the applicationwithout reporting too many transient faults to the application may beaccomplished.

In some embodiments of the apparatus according to the invention, thenodes may be in-vehicle devices of a vehicle control system. In someembodiments, the status of a node may be either correct or faulty.

The method according to the invention as well as, the apparatusaccording to the invention may be implemented by a computer program.Therefore, the invention also comprises a computer program product, thecomputer program product comprising a computer-readable medium and acomputer program recorded therein in the form of a series of stateelements corresponding to instructions which are adapted to be processedby a data processing means of a data processing apparatus, such that amethod according to the invention is carried out or and apparatusaccording to the invention is formed on the data processing means.

Further embodiments and details of the present invention will beexplained in the following with reference to the figures.

FIG. 1 shows a system block diagram of a vehicle control system.

FIG. 2 shows nodes of a communication network according to the prior

FIG. 3 shows some aspects of the nodes of a communication networkaccording to the prior art.

FIG. 4 illustrates the amount of evaluation result data that is storedin the status evaluation result storage section in the nodes of acommunication network according to the prior art.

FIG. 5 illustrates an approach of the prior art for synchronizing thenodes of a communication network.

FIG. 6 illustrates one embodiment of the apparatus for monitoring astatus of nodes of a communication network according to one aspect ofthe present invention.

FIG. 7 shows the communication network divided into clusters accordingto one aspect of the invention.

FIG. 8 illustrates the node status data that each node stores in itsstatus evaluation result storage section according to one aspect of thepresent invention.

FIG. 9 illustrates an embodiment of the method for monitoring a statusof nodes of a communication network according to an aspect of thepresent invention.

FIG. 10 shows an embodiment of the apparatus for monitoring a status ofnodes of a communication network according to one aspect of the presentinvention.

FIG. 12 illustrates the exchange of data between the nodes according toone aspect of the invention.

FIG. 13 shows one embodiment of the apparatus for monitoring a status ofnodes of a communication network according to one aspect of the presentinvention.

FIG. 14 illustrates the behaviour of an outage counter according to oneaspect of the invention in comparison to a failure counter.

In the following, one embodiment of the apparatus and the method formonitoring a status of nodes of a communication network according to oneaspect of the present invention will be explained with reference toFIGS. 6 to 8. The apparatus for monitoring a status of nodes of acommunication network 600 according to the embodiment comprises meansfor dividing the communication network into clusters of nodes 610 thatdivide the network into clusters. FIG. 7 shows an example of theclusters. The first node of cluster 1 is denoted 1-1, the second node ofcluster 1 is denoted 1-2, and so forth. As can be seen in FIG. 7, nnodes have been logically divided into n/c clusters, wherein eachcluster consists of c nodes. Although in the shown example it is assumedthat n is dividable by c, the following discussion is also applicable toa system having n nodes not dividable by c.

Means for determining first node status data at each node of thecommunication network by diagnosing the own status of the determiningnode and the status of the other nodes of the communication network 620determine first node status data at each node of the communicationnetwork by diagnosing the own status of the determining node and thestatus of the other nodes of the communication network.

Afterwards, the means for sending first node status data relating to thenodes of the cluster of the determining node from the determining nodeto the other nodes of the communication network 630 sends first nodestatus data relating to the nodes of the cluster of the determining nodefrom the determining node to the other nodes of the communicationnetwork, where the first node status data is received as second nodestatus data.

Furthermore, the means for receiving second node status data relating tothe nodes of the cluster of the sending node from the other nodes of thecommunication network 640 receive second node status data relating tothe nodes of the cluster of the sending node from the other nodes of thecommunication network. Based on the determined first node status dataand the second node status data received from the other nodes of thecommunication network, the means for determining node status evaluationdata 650 determines node status evaluation data for the nodes of thecommunication network.

FIG. 8 illustrates the node status data that each node stores in itsstatus evaluation result storage section x201 shown, for example, inFIG. 3. As can be easily seen from FIG. 8, every node stores c×c nodestatus data items for each cluster, and since there are n/c clusters inthe system, the status evaluation result storage and communicationbandwidth can be reduced to only n×c bits, which also reduces theexecution time of the voting process by c/n.

The vacant areas shown in FIG. 8 illustrate the gain of computationaleffort and communication bandwidth in comparison to FIG. 4. Only n×cbits of memory need to be allocated rather than n×n bits in the priorart.

As illustrated in FIG. 8, based on the node status data for a node avoting process determines node status evaluation data for the node.

FIG. 9 illustrates an embodiment of the method for monitoring a statusof nodes of a communication network according to an aspect of theinvention. As shown in FIG. 9, node 2-1 receives second node status datafrom the nodes 1-1, 1-2, and 1-4 of cluster 1, a cluster to which node2-1 does not belong. Node 2-1 does not receive any second node statusdata from node 1-3 and therefore determines that either itself or node1-3 is faulty. In other words, node 2-1 determines first node statusdata by diagnosing the status of node 1-3. This first node status dataindicate that either node 1-3 or node 2-1 is faulty.

Node 2-1 evaluates the second node status data received from the nodes1-1, 1-2, and 1-4 as shown on the right side of FIG. 9. “1” representsthat the node is correct, “0” denotes that the node is faulty, and “−”denotes that no information has been received. As can be seen on theright side of FIG. 9, node 1-1 informs node 2-1 that according to theview of node 1-1, the nodes 1-2, 1-3, and 1-4 are working correctlybecause the first bit of the second node status data containsinformation about the node 1-1, the second bit contains informationabout the node 1-2, the third bit contains information about the node1-3, and the forth bit contains information about the node 1-4.

The node 2-1 evaluates the second node status data and determines thataccording to the view of the nodes of the cluster, every node in thecluster is working correctly. However, node 2-1 has not receivedinformation from node 1-3, which means that either node 1-3 or node 2-1is faulty. Since node 2-1 derived from the second node status data thatall the nodes of the cluster 1 are working correctly, node 2-1 canconclude that itself must be faulty. In other words, node 2-1 determinesthat the first node status data is inconsistent with the second nodestatus data and therefore diagnoses the status of itself as faulty.

In the following, an embodiment of the apparatus and the method formonitoring a status of nodes of a communication network according to oneaspect of the present invention will be explained with reference toFIGS. 10 to 12. The shown embodiment of the apparatus 1000 comprisesmeans for determining first node status data at a first node bydiagnosing the own status of the first node and the status of at leastone second node 1001 that determine first node status data by diagnosingthe own status of the first node and the status of at least one secondnode. Means for sending 1003 send the first node status data to at leastone second node. Second node status data from at least one second nodeis received by the means for receiving 1004.

The apparatus 1000 furthermore comprises means for determining nodestatus evaluation data at the first node based on the determined firstnode status data and the received second node status data 1002 thatdetermine node status evaluation data based on the determined first nodestatus data and the received second node status data. The sending andreceiving is periodically performed in communication rounds. A group ofnodes to which the first node belongs is defined by means for defining agroup of nodes to which the first node belongs 1005.

The apparatus 1000 furthermore comprises means for determining that acertain node is faulty based on the determined node status evaluationdata 1006. When these means determine that a certain node is faulty, thefirst failure counter for the certain node is initialized to 1 by themeans for initializing a first failure counter for the certain node 1007that is a part of the apparatus 1000. The first failure counter isincremented in each communication round in which the node statusevaluation data indicate that the certain node is faulty by means forincrementing the first failure counter 1008. The first failure counteris sent to the other nodes of the group by the means for sending 1003and the means for receiving 1004 receive second failure counters for thecertain node from the other nodes of the group.

The apparatus 1000 furthermore comprises means for determining a failurecounter value that most of the failure counters of the nodes of thegroup have 1009 that determine a failure counter value that most of thefailure counters of the nodes of the group have. The means for adjustingthe first failure counter 1010 adjust the first failure counter if thedetermined failure counter value is different from the value of thefirst failure counter.

The means for notifying an application 1011 notify an application thatuses the communication network of a fault of the certain node, if thefirst failure counter reaches a predetermined threshold value.

FIG. 11 shows one embodiment of a structure of the data that is sent bythe means for sending 1003. The data structure comprises a field for thenode status data 1110 that comprises one sub-field for each diagnosednode. Furthermore, a failure notification synchronization flag field1120 is provided in the shown data structure. Each bit in this failurenotification synchronization flag field 1120 indicates whether a certainnode is faulty or not, which is determined, for example, by the means1011. In a failure counter value field 1130, a failure counter value istransmitted.

In this failure counter value field 1130, the failure counter value of acertain node is transmitted based on the communication round. Asillustrated in the example of FIG. 12, in round k the failure countervalue for node 1-1 is transmitted, whereas in round k+1 the failurecounter value for node 1-2 is transmitted. FIG. 12 illustrates thesystem using a clustering concept according to one aspect of the presentinvention, such that first failure counter values relating to the nodesof the cluster are sent within the cluster. Nevertheless, the embodimentillustrated in FIGS. 10 to 12 is also applicable to systems without theclustering concept.

Even though more communication bandwidth (namely log2 Pth bits, wherePth is the threshold value of the failure counter value) is required,the failure counter value of nodes that have an incorrect failurecounter value (due to, for example, bit inversions, etc.) can beadjusted before the number of nodes with corrupt failure countersincreases too much.

In the following, with reference to FIGS. 13 and 14, an embodiment ofthe apparatus and the method for monitoring a status of nodes of acommunication network according to one aspect of the present inventionwill be explained. The apparatus 1300 shown in FIG. 13 comprises meansfor determining first node status data 1301 that determines first nodestatus data at a first node by diagnosing the own node status of thefirst node and the status of at least one second node. The means forsending 1303 sends the first node status data to at least one secondnode and the means for receiving 1304 receive second node status datafrom at least one second node. Based on the determined first node statusdata and the received second node status data, means for determiningnode status evaluation data 1302 determine node status evaluation dataat the first node. The sending and receiving performed by the means forsending 1303 and the means for receiving 1304 is periodically performedin communication rounds.

The apparatus 1300 furthermore comprises means for determining that acertain node is faulty based on the node status evaluation data 1305.When the certain node is detected as faulty after a predetermined numberof communication rounds in which the certain node was working correctly,the means for initializing an outage counter 1306 initializes an outagecounter. Afterwards, the means for incrementing the outage counter 1307increments the outage counter in each communication round after theinitialization of the outage counter.

The apparatus 1300 shown in FIG. 13 furthermore comprises means fornotifying an application 1308 that notify an application that uses saidcommunication network of the fault of the certain node, if the outagecounter reaches a predetermined threshold value. The means for resettingthe outage counter 1309 reset the outage counter to 0, when it isdetected that the certain node was working correctly in a predeterminednumber of previous communication rounds.

FIG. 14 illustrates the behaviour of the outage counter in comparison tothe failure counter using three scenarios. The outage counter as well asthe failure counter may be handled based on the first and second nodestatus data, not only the first node status data. However, in thefollowing discussion of FIG. 14, for simplicity, a fault of the receiveris not considered, which means that a fault is assumed to be caused bythe sending node. In the first line of each scenario (cases 1 to 3), acircle means that a message is correctly received in the communicationround, i.e., the sending node is working correctly, while a cross meansthat the message is not correctly received, i.e., the sending node isfaulty. The threshold value is 4 for the outage counter as well as thefailure counter and the predetermined number of correct previouscommunication rounds for the resetting is 3, which means that the outagecounter increases until three continuous messages are correctlyreceived, because the application program is assumed to calculate theoutput result based on the three last received messages. This is oftenassumed by application programs executing a control logic.

Case 1 illustrates a permanent fault. The permanent fault is presentbeginning in round 4. In this scenario, the failure counter and theoutage counter have the same behaviour, namely the failure counter andthe outage counter are incremented by 1 beginning in round 4. In round7, the counters reach the threshold value of 4 and the application isnotified.

Case 2 illustrates what happens when the fault is intermittent, whichmeans that the fault is present in some rounds and vanishing in otherrounds without being completely overcome. The first fault occurs inround 4. The outage counter starts being incremented and reaches thethreshold value in round 7, such that the application is notified inround 7. By contrast, the failure counter is incremented each time anintermittent fault occurs. Since the intermittent fault occurs in rounds4, 5, 8 and 11, it lasts until round 11 till the application is notifiedbased on the failure counter. This means that in the case ofintermittent faults, potentially the outage counter leads to a fasternotification of the application.

Case 3 illustrates a scenario where the intermittent faults occurinfrequently, such that the outage counter is reset. The first faulthappens in round 3, in which the failure counter and the outage counterare incremented by 1. The outage counter is incremented in each rounduntil round 6 in which the system notices that the node was workingcorrectly in the last three rounds, such that the outage counter isreset to zero. The next fault occurs in round 7, where the failurecounter is incremented to 2 and the outage counter is initialized to 1again and starts counting. The outage counter is incremented and reachesthe value 4 in round 10 without being reset again, since until round 10no period of three rounds in which the node worked correctly occurs.This means that in round 10, the application is notified based on theoutage counter. In the same round, the failure counter is still at thevalue 3. Since the next fault occurs not earlier than in round 18, basedon the failure counter, the notification of the application is performedin round 18.

The explanations of the embodiments and the drawings are to beunderstood in an illustrative rather than in a restrictive sense. It isevident that various modifications and changes may be made theretowithout departing from the scope of the invention as set forth in theclaims. It is possible to combine the features described in theembodiments in a modified way for providing additional embodiments thatare optimized for a certain usage scenario. As far as such modificationsare readily apparent for a person skilled in the art, thesemodifications shall be regarded as disclosed by the above describedembodiments.

1. Method for monitoring failure status of nodes of a communicationnetwork, the method comprising: dividing the communication network intoclusters of nodes; determining node status data regarding presence orabsence of a failure at each node of the communication network bycausing each node to diagnose its own status and the status of the othernodes of the communication network; sending first node status datarelating to the nodes of the cluster of the determining node from thedetermining node to the other nodes of the communication network; eachnode receiving from the other nodes of the communication network, secondnode status data relating to presence or absence of a failure at thenodes of the cluster of the sending node; and determining node statusevaluation data for identifying a particular failed node or nodes of thecommunication network by majority voting, based on the determined firstnode status data and the second node status data received from the othernodes of the communication network.
 2. Method for monitoring a status ofnodes of a communication network, the method comprising: determiningfirst node status data at a first node by causing said first node todiagnose its own status and the status of at least one second node;sending the first node status data to at least one other node; receivingsecond node status data from at least one other node, said sending andreceiving being performed periodically in communication rounds;determining node status evaluation data for the first node based on thedetermined first node status data and the received second node statusdata; and determining that a certain node is faulty by majority vote,based on the node status evaluation data; wherein said method furthercomprises, initializing an outage counter, when the certain node isdetected as faulty after a predetermined number of communication roundsin which the certain node was working correctly; and incrementing theoutage counter in each communication round in which the certain noderemains faulty after the initialization of the outage counter.
 3. Methodaccording to claim 2, further comprising notifying an application thatuses said communication network of a fault of the certain node if theoutage counter reaches a predetermined threshold value.
 4. Methodaccording to claim 1, wherein the nodes are in-vehicle devices of avehicle control system.
 5. Method according to claim 1, wherein thestatus of a node is either correct or fault.
 6. Apparatus for monitoringfailure status of nodes of a communication network, the apparatuscomprising: means for dividing the communication network into clustersof nodes; means for determining node status data regarding presence orabsence of a failure at each node of the communication network bycausing each node to diagnose its own status and the status of the othernodes of the communication network; means for sending first node statusdata relating to the nodes of a cluster of the determining node from thedetermining node to the other nodes of the communication network; meansfor receiving, from the other nodes of the communication network, secondnode status data relating to a presence or absence of a failure at thenodes of the cluster of the sending node; and means for determining nodestatus evaluation data for identifying a failed node or nodes of thecommunication network by majority voting, based on the determined firstnode status data and the second node status data received from the othernodes of the communication network.
 7. Apparatus for monitoring a statusof nodes of a communication network, the apparatus comprising: means fordetermining first node status data at a first node by causing said firstnode to diagnose its own status and the status of at least one secondnode; means for sending the first node status data to at least one othernode; means for receiving second node status data from at least oneother node, said sending and receiving being performed periodically incommunication rounds; means for determining node status evaluation datafor the first node based on the determined first node status data andthe received second node status data; means for determining that acertain node is faulty by majority vote, based on the node statusevaluation data; wherein said apparatus further comprises, means forinitializing an outage counter, when the certain node is detected asfaulty after a predetermined number of communication rounds in which thecertain node is working correctly; and means for incrementing the outagecounter in each communication round in which the node remains faultyafter the initialization of the outage counter.
 8. Apparatus accordingto claim 7, further comprising means for notifying an application thatuses said communication network of a fault of the certain node if theoutage counter reaches a predetermined threshold value.
 9. Apparatusaccording to claim 6, wherein the nodes are in-vehicle devices of avehicle control system.
 10. Apparatus according to claim 6, wherein thestatus of a node is either correct or fault.
 11. A computer programproduct, the computer program product comprising a non-transitorycomputer readable medium and a computer program recorded therein in formof a series of state elements corresponding to instructions which areadapted to be processed by a data processing means of a data processingapparatus such that a method according to claim 1 is carried out.
 12. Acomputer program product, the computer program product comprising anon-transitory computer readable medium encoded with a computer programthat includes instructions which, when loaded into a data processor,cause the data processor to be configured to perform the method of claim1.